Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study

نویسنده

  • Ibrahim Abu El-Khair
چکیده

The effectiveness of three stop words lists for Arabic Information Retrieval---General Stoplist, CorpusBased Stoplist, Combined Stoplist ---were investigated in this study. Three popular weighting schemes were examined: the inverse document frequency weight, probabilistic weighting, and statistical language modelling. The Idea is to combine the statistical approaches with linguistic approaches to reach an optimal performance, and compare their effect on retrieval. The LDC (Linguistic Data Consortium) Arabic Newswire data set was used with the Lemur Toolkit. The Best Match weighting scheme used in the Okapi retrieval system had the best overall performance of the three weighting algorithms used in the study, stoplists improved retrieval effectiveness especially when used with the BM25 weight. The overall performance of a general stoplist was better than the other

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effects of Stop Words Elimination for AIR

The effectiveness of three stop words lists for Arabic Information Retrieval---General Stoplist, CorpusBased Stoplist, Combined Stoplist ---were investigated in this study. Three popular weighting schemes were examined: the inverse document frequency weight, probabilistic weighting, and statistical language modelling. The Idea is to combine the statistical approaches with linguistic approaches ...

متن کامل

Query Term Selection Strategies for Web-based Chinese Factoid Question Answering

Passage retrieval plays an important role in a Chinese factoid Question Answering (QA) system. Query term selection is the process of choosing keywords from a given question to make the most use of information retrieval engines. Query terms selected by humans are analyzed to measure the difficulty and for evaluating machine generated results. Three approaches, namely stop words elimination, rul...

متن کامل

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...

متن کامل

Stop-Word Removal Algorithm and its Implementation for Sanskrit Language

In the Information era, optimization of processes for Information Retrieval, Text Summarization, Text and Data Analytic systems becomes utmost important. Therefore in order to achieve accuracy, extraction of redundant words with low or no semantic meaning must be filtered out. Such words are known as stopwords. Stopwords list has been developed for languages like English, Chinese, Arabic, Hindi...

متن کامل

Plagiarism Detection In Arabic Scripts Using Fuzzy Information Retrieval

The nature of Arabic language structure exposes the need for fuzzy or vague concept to reveal dishonest practices in Arabic documents. In this paper, we present a statement-based plagiarism detection approach in Arabic scripts using fuzzy-set IR model. The degree of similarity is calculated and compared to a threshold value to judge whether two statements are the same or different. Our corpus c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1702.01925  شماره 

صفحات  -

تاریخ انتشار 2008